One in three adults do not get enough sleep everyday in the United States1. Sleep is extremely important for one’s quality of life, much like eating healthy. However, there is an increasing incidence of people not obtaining enough sleeping on a daily basis. It is known that sleep deficiency can lead to both chronic mental and physical health problems, including increased risk of heart disease and depression2. To better understand how sleep deprivation can affect one’s mental health, and as a student who often suffers from having sleep deficiency, I specifically wanted to look at how sleep deprivation is affecting cognition. Using the transcriptomic data generated by the authors of the paper Altered hippocampal transcriptome dynamics following sleep deprivation3, I investigate how sleep deprivation alters the expression of genes involved in learning and memory. There have been studies that show significant associations between sleep deprivation and neurological disorders4, and accordingly, I hypothesized that sleep deprivation significantly would alter the expression of key genes associated with cognitive function and memory.
FastQC was run to check the quality of the fastq files. The results showed adapter contamination and this issue was resolved using TrimGalore. After running FastQC again after trimming, there were still a few warnings and failed tests. However, the overall quality of the samples was good enough to carry out subsequent analyses. Following quality control analyses, alignment of the samples was done using STAR. Majority of the reads were seen to be uniquely mapped and the scores were good, which allowed for differential gene expression (DGE) analysis.
DGE was done using DESeq2 and was followed up with gene ontology (GO) term enrichment analysis. This allowed for the identification of significant biological processes and cellular components that were affected by sleep deprivation. Less stringent alpha and p-values were used (0.1) as this is an exploratory analysis. The figures refered to in the following paragraphs are in the DGE Analysis and GSEA sections under Methods.
Using a Revigo TreeMap to visualize the significant GO terms, it was seen that processes involved in mRNA metabolic processes, protein dephosphorylation, and cellular response to growth factors were identified. These processes are highly related in brain and memory. In particular, neurons rely heavily on complex RNA splicing allowing for plasticity and learning. The misregulation of this is also associating with cognitive impairment. Growth factors are also very important in long-term memory formation and activate signaling pathways that regulate transcription. For the significant cell component terms, again regulation of transcription is highlighted as the spliceosomal complex, chromatin, and other transcription factor complexes were identified as significant. In particular, the CREB1 transcription factor is known to play a key role in memory formation. Chromatin and chromosomes also play an important role in memory consolidation and prevention of neuronal degeneration.
A GSEA analysis revealed activation of genes that play a part in regulation of lipid storage and forebrain neuronal development. This finding was particularly interesting as it appears that sleep deprivation significantly affects genes involved in forebrain neuron development, which plays an important role in learning and memory. The regulation of lipid storage is likely affected due to an inflammatory response induced by the sleep deprivation. The GSEA analysis also revealed suppression of pathways related to the eye and lens, which don’t seem to relate much with the question we are investigating but may be a downstream effect of disregulation in an different pathway.
Limitations of dataset
The sample size of this dataset is quite small (18 samples) and is collected at only one time point. Increasing the sample size and collecting more samples across different time points could enhance the differences between the condition groups.
Additionally, even after trimming, the FastQC results showed that there was quite a lot of sequence duplication and the per base sequence content was poor. Both these properties can reduce the statistical power making it harder to identify truly differentially expressed genes and their pathways.
One of the main limitations of this investigation is the dataset itself. Due to the characteristics of bulk RNA-seq, discerning cell types is not possible. Single-cell RNA-seq or spatial transcriptomics could potentially provide more statistical power by providing the ability to assess more cell-specific gene expression changes. Additionally, the amount of sleep deprivation could alter the resultant response drastically. To adress this, samples taken over a series of time points could provide more information about how the level of sleep deprivation affects the transcriptome. While interpreting the enriched pathways is a good way to infer the changes due to sleep deprivation, it would be better to have some more behavioral data assessing mice’s cognitive changes like a learning or memory test after a period of sleep deprivation.
The GO terms looked at comprise of all the GO terms available for the species. A more concentrated approach focused on gene sets specific to the brain could improve the analysis. Furthermore, to corroborate the changes in the transcript level, wet lab experiments can confirm the changes in these levels to further confirm these findings.
Future Directions
The data for this project is obtained from the NCBI Gene
Expression Omnibus (GEO) under the accession number
GSE166831. The authors of the publication “Altered
hippocampal transcriptome dynamics following sleep deprivation”5 from the
Molecular Brain journal generated this data at the Genomics
Division at the Iowa Institute of Human Genetics.
RNA extraction
Whole hippocampi were removed and flash frozen on dry ice from mice (Mus musculus). The samples were homogenized in Qiazol (Invitrogen) and using chloroform, were phase separated. This was followed by centrifugation at 14,000g for 15 minutes. Then, RNA was extracted using the RNeasy kit (Qiagen). Any DNA contamination was removed with RNase-Free DNase (Qiagen). Nanodrop 1 and Agilent Bioanalyzer were used to assess the quality of the samples after resuspending them in RNase-free water. Only samples that had an OD 260/280 and OD 260/230 ratio ~2.0 along with an RNA integrity number (RIN) of >8 were selected for subsequent library preparation.
Library Preparation
The Illumina TruSeq Stranded Total RNA sample prep kit was used to create the sequencing libraries and they were prepared for sequencing using standard Illumina protocols. Ribo-Zero Gold was used to enrich for mRNA through ribo-depletion as this method removes ribosomal RNA. Library concentrations were measured using the KAPA Illumina Library Quantification Kit (KAPA Biosystems). Paired-end sequencing (150 bp paired-end reads) was done in two lanes and 18 samples were sequenced in two batches.
Cell type
The mouse strain used was C57BL/6j and the entire hippocampus was used for this experiment. Accordingly, the total RNA is extracted from cells of hippocampal tissue samples, which mainly comprise of pyramidal neurons, granule cells, and mossy cells.
Experimental condition
The mice were either under the condition where their sleep was deprived by moving and tapping their cage (gentle handling method) or under the condition where their sleep was not disturbed.
Sequencing Platform
The Illumina HiSeq 4000 was used to do the sequencing.
The FASTQ files are described here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA702096.
There are 18 samples for this project and their unique accession run
IDs are obtained from the link above by clicking on the Accession List
button, which downloads the list of IDs locally. I created a file
PRJNA702096.txt containing this list of IDs, and the
relevant FASTQ files were downloaded using SRA-toolkit from this list.
Below is the script used to do this (on github:
load_data.sh).
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=125G
#SBATCH --time=4:00:00
cat PRJNA702096.txt | xargs -I {} -P 16 bash -c 'echo -e "Downloading FASTQ file for sample: {} \n"; prefetch {}; fastq-dump --split-files --gzip {}/'
FastQC was run on the samples and from the output.
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=160G
#SBATCH --time=8:00:00
cat PRJNA702096.txt | xargs -n 1 -P 18 bash -c '
run_fastqc() {
sample_id=$1
r1="fastqs/${sample_id}.fastq.gz"
r2="fastqs/${sample_id}.fastq.gz"
fastqc $r1 --extract -o fastqc_output/${sample_id}/
fastqc $r2 --extract -o fastqc_output/${sample_id}/
}
run_fastqc "$0"'
mamba activate multiqc
multiqc fastqc_output/*
MultiQC was run to collate the results of the FastQC run on each sample and below are some plots generated from the same.
From the output it is evident that the fastq files contain adapter sequences and the adapters need to be trimmed. The per base sequence content test also failed for most samples, which is likely due to the adapter contamination. The test for sequence duplication levels also failed for many samples, suggesting that there are many copies of a single read from over-amplification. There are warnings about potential excessive overrepresented sequences. This could mean that there are sequence repeated in the genome or potential contamination. The adapter content test also failed, as expected from the other failed tests.
TrimGalore was used to solve the adapter contamination issue. Below
is the script used (run_trim_galore.sh).
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=160G
#SBATCH --time=8:00:00
cat PRJNA702096.txt | xargs -n 1 -P 18 bash -c '
run_trimming() {
sample_id=$1
r1="${sample_id}_1.fastq.gz"
r2="${sample_id}_2.fastq.gz"
trim_galore --paired --illumina --cores 4 -o trimmed/ $r1 $r2
}
run_trimming "$0"'
Script used to run FastQC after trimming is shown below
(run_fastqc.sh).
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=160G
#SBATCH --time=8:00:00
cat PRJNA702096.txt | xargs -n 1 -P 18 bash -c '
run_fastqc() {
sample_id=$1
r1="trimmed/${sample_id}_1_val_1.fq.gz"
r2="trimmed/${sample_id}_2_val_2.fq.gz"
fastqc $r1 --extract -o fastqc_output_trimmed/${sample_id}/
fastqc $r2 --extract -o fastqc_output_trimmed/${sample_id}/
}
run_fastqc "$0"'
mamba activate multiqc
multiqc fastqc_output_trimmed/*
MultiQC was run again to collate the results and some plots from it are shown below.
The adapter contamination was
removed, which was the most important issue. The overrepresented
sequences could be due to highly expressed sequences and the failed
tests for sequence duplication levels could be due to overamplification.
The variability in sequence length distribution is due to the trimming
so it also not too concerning. While there were still some warnings and
failed tests, since they seemed to apply to most samples they are not of
high concern. The samples are ready to be aligned now.
First, the genome and annotation data is obtained to create the index
required to align the reads. The genome is downloaded from the UCSC
Genome Browser:
https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
The annotation is also downloaded from the UCSC Genome Browser:
https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz
Using STAR, the index is built for the mm10 mouse strain.
The parameters used are:
–runMode: This parameter is set as genomeGenerate to
build the genome index.
–runThreadN: This parameter is set to 16 to run
faster (on 16 CPUs).
–genomeDir: This parameter is set to mm10_STARindex,
which points to the directory where the index will be stored.
–genomeFastaFiles: This parameter points to the genome FASTA
file, which is mm10.fa.
–sjdbGTFfile: This parameter points to the gene annotation file,
which is mm10.gtf.
–sjdbOverhang: This parameter is set to 149 because
the read length is 150 and the optimal spliced alignment is the read
length - 1 = 150 - 1 = 149. This parameter allows more accurate
detection of splice junctions with respect to the length of the read and
is more suited for RNA-seq.
wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz"
gunzip mm10.fa.gz
wget "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz"
gunzip mm10.ncbiRefSeq.gtf.gz
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir mm10_STARindex \
--genomeFastaFiles mm10_files/mm10.fa \
--sjdbGTFfile mm10_files/mm10.ncbiRefSeq.gtf \
--sjdbOverhang 149
Now that the index has been created, the alignment can be run. STAR is used to run the alignment as it is optimized for RNA-seq datasets whereas BWA is better used for DNA-seq like whole genome sequencing.
The parameters used are:
–runMode: This parameter is set as alignReads to
align the reads to the genome.
–runThreadN: This parameter is set as 18 to run the
alignment faster (on 18 CPUs).
–genomeDir: This parameter is set to mm10_STARindex,
which points to the directory where the index is stored.
–readFilesIn: This parameter points to the paired-read files to
be aligned SRR13720930_2_val_2.fq.gz and
SRR13720930_2_val_2.fq.gz.
–readFilesCommand: This parameter is set to zcat to
decompress the files as the alignment runs.
–outFileNamePrefix: This parameter specifies the prefix of the
output files as
, which isalignments/SRR13720930.to store the alignment output results in the directoryalignments`.
–outSAMtype: This parameter is set to
BAM SortedByCoordinate to output the aligned reads as a BAM
file and sorted by genomic coordinates.
–limitBAMsortRAM: This parameter increases the allocated memory
to sort the BAM file to 100GB (100000000000).
Below is the script used to run all the alignments
(run_alignments.sh).
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=100G
#SBATCH --time=4:00:00
for line in $(cat PRJNA702096.txt); do
r1="trimmed/${line}_1_val_1.fq.gz"
r2="trimmed/${line}_2_val_2.fq.gz"
STAR --runMode alignReads --runThreadN 16 \
--genomeDir ../project_old/mm10_STARindex/ \
--readFilesIn $r1 $r2 \
--readFilesCommand zcat \
--outFileNamePrefix alignments/${line}/${line}. \
--outSAMtype BAM SortedByCoordinate \
--limitBAMsortRAM 100000000000
done
To better identify samples by their condition group and batch I renamed them using the script below.
NSD refers to the condition group where there was no
sleep deprivation induced in the mice.
SD refers to the condition group where the mice were
sleep deprived.
batch1 and batch2 are the labels of the
two batches.
#!/bin/bash
while IFS=, read -r run group; do
old_dir="alignments/${run}/"
new_dir="alignments/${group}/"
for file in "$old_dir"*; do
filename=$(basename "$file")
extension="${filename#${run}}"
new_filename="${group}${extension}"
mv ${file} ${new_dir}${new_filename}
done
done < sra_run_id_to_group.csv
rm -r alignments/SRR*
After running STAR, MultiQC was run to get an overview of the percentage of reads mapped and the alignment scores. Below are the plots generated.
To understand how the mapped reads are
distributed over the different features in the genome RSeQC was run.
Below is the script used to run it (
run_rseqc.sh).
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=100G
#SBATCH --time=8:00:00
for SAMPLE in alignments/*/*.bam; do
samtools index $SAMPLE
done
for SAMPLE in alignments/*/*.bam; do
echo $SAMPLE >> bam_list.txt
done
for SAMPLE in alignments/*/*.bam; do
SAMPLE_NAME=$(basename ${SAMPLE} .Aligned.sortedByCoord.out.bam)
read_distribution.py -i ${SAMPLE} \
-r ../project_old/mm10_files/mm10_RefSeq.bed > \
rseqc_results/${SAMPLE_NAME}_read_distribution.txt
done
Majority of the reads are exons, mainly CDS
exons, which is good. There is also a large proportion of reads mapped
to intronic regions. This could be due to gDNA contamination. However,
due to alternative splicing, some introns may be retained, which could
also contribute to the these mapped reads.
QoRTs was also run for further QC. QoRTs was chosen over RSeQC as it allows for comparisons to be made between batches and between the condition groups.
Below is the script used to run QoRTs (run_qorts.sh).
The decoderFile.txt was manually generated (and is
available to view on github) to create the MultiQC plots from the
indivisual QoRTs output.
#!/bin/bash
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=200G
#SBATCH --time=8:00:00
for SAMPLE in alignments/*/*.bam; do
echo $SAMPLE >> bam_list.txt
done
cat bam_list.txt | xargs -n 1 -P 18 bash -c '
run_qorts() {
SAMPLE=$1
SAMPLE_NAME=$(basename "$SAMPLE" .Aligned.sortedByCoord.out.bam)
java -Xmx4G -jar /athena/angsd/scratch/mef3005/share/envs/qorts/share/qorts-1.3.6-1/QoRTs.jar QC \
--stranded \
--generatePdfReport \
${SAMPLE} \
../project_old/mm10_files/mm10.ncbiRefSeq.gtf \
qorts_results/$SAMPLE_NAME/
}
run_qorts "$0"'
Rscript /athena/angsd/scratch/mef3005/share/envs/qorts/share/qorts-1.3.6-1/qortsGenMultiQC.R qorts_results/ decoderFile.txt multiqc/
Below are the plots generated after running the Rscript
qortsGenMultiQC.R mentioned in the script above.
Comparing the two condition groups NSD and SD
Comparing the two batches